Register Renamer

It is legal for the renamer to use physical register number zero. While architectural register number zero is not assigned a physical register. Architectural register number zero is always bypassed to zero and hence does not need a physical register number.

Physical target registers are assigned by Qupls\_reg\_renamer during the rename stage of the pipeline. Target registers for architectural register zero are not assigned.

I have forgotten the exact heuristic for the number of physical registers that should be present. It must be significantly more than the architectural number of registers. Since this architecture has a lot of registers, that means loads of them. Fortunately block RAM is used to implement the register file and can provide up to 512 physical registers. That many register are not needed. The design is restricted to 256 physical registers. This is about 3.5 times as many architectural registers, which should be plenty. The 256 registers not used in the block RAM are reserved for future usage.

The renamer uses four fifos that can each contain 64 rename register tags. The number of tags supported by all four fifos is thus 256, matching the number of physical registers. Each fifo may be used to allocate a target physical register every clock cycle. Therefore, up to four target registers may be assigned.

Alternate Register Renamer – low resource usage

* Circular buffer of register tags implemented with FPGA SRL’s.

Register File

There are 512 registers required to support the full ISA register set including vector registers. Because of the large number of registers, the register set is implemented in block RAMs. To get the required depth of approximately 1536 registers, three block RAMs are needed per register port. Approximately 192 block RAMs are needed to support 16 read ports and 4 write ports (16x4x3). The demo version uses only two write ports and 32 vector registers so only about 48 block RAMs are needed.

Physical Register zero is bypassed to the value zero.

The stack pointer register is banked depending on the operating mode or interrupt level. This is easily accomplished by adding the operating mode to the specified register. To be a little more efficient an ‘or’ operation is used instead of an add.

|  |  |  |
| --- | --- | --- |
| Vector Regno | Regno | Usage |
| 0 | 0 | Always written as zero, thus reads as zero |
| 0 to 3 | 1 to 30 | Programming model visible registers |
| 3 | 31 | Alias for registers 40 to 43, looks like the stack pointer |
| 4 | 32 | Machine stack pointer |
| 4 | 33 to 39 | Interrupt stack pointers |
| 5 | 40 | Application / User stack pointer |
| 5 | 41 | Supervisor stack pointer |
| 5 | 42 | Hypervisor stack pointer |
| 5 | 43 | Safe stack pointer |
| 5 | 44 | Micro code temporary |
| 5 | 45 | Micro code temporary |
| 5 | 46 | Micro code temporary |
| 5 | 47 | Micro code temporary |
| 6 | 48 to 255 | 26 vector registers |

An eleven-bit physical register tag is in use as there may be up to 1536 or so registers needed. The last register tag of all ones is reserved for uninitialized registers. It is possible for the core to have a register as a source register before it has been properly loaded. In that event there would be no physical register assigned for it yet. This is represented with the tag of all one’s. Hardware forces this register valid so that the machine does not hang waiting for a register to be valid due to a software issue.

Decoder

Need to know if the architectural register is register zero in several places. So the decoder decodes this status into a single bit.

Instruction Extract

The BSR / BRA instruction is trapped in the extract stage and causes an immediate change of the IP. At decode, the BSR / BRA is flagged as done already and thus is never scheduled for execution.

Predicated Execution

Predicated execution of instructions and masking of vector operations is handled using a PRED instruction modifier. The modifier is placed in code before the instructions it applies to. Using the PRED modifier is more code dense than having a predicate register field in every instruction. The PRED modifier shows up only when needed, which is not for most instructions. A single PRED modifier applies for up to eight following instructions. A mask field in the PRED modifier allows instructions to ignore the predicate if the modifier is to be applied to fewer than eight instructions.

The PRED modifier modifies the scheduling of subsequent instructions. Up to eight following instructions may check the predicate status of the PRED instruction.

The PRED modifier is scheduled and executes like any other instruction. It amounts to a bit extract from a register then a case statement based on a mask. It is handled by ALU logic. The PRED modifier writes its result, which is 64-bits, to the ROB entry for the PRED instruction. The ROB was selected as the place to store the predicate result because the result is temporary and needed only by the scheduler for subsequent instructions. Scheduling of subsequent instructions checks for a prior PRED modifier. If found, the appropriate predication bit is then read from the ROB and used to either schedule the instruction on its functional unit (all bits in group have a non-zero value), or schedule the instruction as a copy target on the ALU (all bits in group = 0).

Vector Elements

A vector element is a 64-bit wide slice of a vector register which is treated as a single 64-bit register by the CPU. There are eight 64-bit wide elements to a vector register for a total of 512-bits.

Predicate Groups

When a predicate is applied, each vector element has a predicate byte associated with it. Each bit of the predicate byte is reserved for one or more bytes of the element.

Predicate values are grouped into groups of eight bits. Each bit represents a byte in a register. Each predicate byte represents a vector element. Since there are eight elements in a vector register eight bytes are required or 64-bits. If the element contains a 64-bit value then only the least significant bit of the byte for the element acts as a predicate bit. If the element contains two 32-bit values then the least significant two bits of the predicate byte for the element act as predicate bits. And so on.

|  |  |
| --- | --- |
| Lanes in Element | Bits Checked for Predication |
| 1x64 bit | 0 |
| 2x32 bit | 0 and 1 |
| 4x16 bit | 0 to 3 |
| 8x8 bit | 0 to 7 |
|  |  |

To set a true predicate for all lanes in a vector, where the lanes are 64-bit elements, the least significant bit of each byte of the predicate value must be set. The predicate value in this case would be 0101010101010101h. If other bits of the predicate are set they will be ignored. So, the predicate register may be loaded with all ones for instance, when lanes are 64-bit elements. The value FFFFFFFFFFFFFFFFh works as well.

Note that one of the set instructions will set the predicate bits appropriately for the size of lanes being compared.

Example two: 16-bit lanes are being used. Predicate value 0F0F0F0F0F0F0F0Fh will enable all lanes. The value FFFFFFFFFFFFFFFFh works as well. To mask off lane zero the value 0F0F0F0F0F0F0F0Eh would be used.

Sync

For the demo version, to reduce the logic footprint, any sync instruction will cause a pipeline flush. Demo sync does not resolve before and after fields of the instruction. This guarantees the sync will work at the cost of performance.

Quad Precision

Since the CPU is a 64-bit machine with 64-bit registers some means must be arrived at to perform 128-bit quad precision operations. The solution used is to perform the operation using register pairs. The pair of registers is specified by a combination of the quad precision instruction and an instruction modifier, QFEXT, dedicated to performing quad precision operations. The modifier supplies registers to hold the upper 64-bits of the quad precision value.

The quad precision operation then borrows an ALU port to act as a venue to be able to store the quad precision value. A quad precision operation uses the ALU as a holding place to store values. The scheduler sees the quad precision modifier and schedules it for the ALU. The modifier is does not complete its execution until the quad precision operation is complete. The scheduler schedules a quad precision operation as a pair of operations, one on an ALU used for passthrough, and one on the floating-point unit.

ALU Pair Instructions

ALU instructions that require a pair of ALUs are issued to two ALUs at the same time. Both ALUs see the same instruction. However, the ‘C’ register port is used as the target register for the high-order ALU. For instance, the MULW, multiply widening instruction, causes both ALUs to perform the multiply however high order product bits are written by ALU #1 while low order product bits are written by ALU #0.

Stomping on Instructions

If the instruction is stomped on before the enqueue stage (rename etc) then registers are not renamed and the instruction is not enqueued, so that it does not waste queue slots. However, if the instruction about to be queued is stomped, it is allowed to be queued as the renamer has already assigned registers, and it is difficult to undo the assignment. So, things are left as is in that case, but the instruction is marked as a copy target instruction.

If there was a cache miss at the fetch stage, it must percolate down to the subsequent stages as the pipeline is advanced. A fetch from micro-code is not considered to be a miss even if there is a cache miss.

Signal Naming Conventions

Many signals are named depending on the pipeline stage.

|  |  |
| --- | --- |
| Stage Output | Signal Postifx |
| Fetch | \_f |
| Extract | \_x |
| Decode | \_d |
| Rename | \_r |
| Enqueue | \_q |

Micro-code

Micro-code instructions short circuit the first two pipeline stages fetch and align as fetch and align are not required for micro-code.

Fetch continues but instructions are sourced from the micro-code store.

The fetch address continues to increment. At the end of the micro-code function a branch is made back to the next instruction after the macro-instruction that triggered the micro-code.